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Abstract 

A lumping of a Markov chain is a coordinate-wise projection of the 
chain. We characterise the entropy rate loss of a lumping of an aperiodic 
and irreducible Markov chain on a finite state space in two ways. First, 
by the random growth rate of the number of realisable preimages of a 
^ finite-length trajectory of the lumped chain. Second, by the possibility 

I ^ to reconstruct original trajectories from their lumped images. Both are 

purely combinatorial criteria, depending only on the transition graph of 
^ the Markov chain and the lumping function. We state sufficient conditions 

on a non-positive transition matrix and a lumping to preserve the entropy 
I I rate. In the sparse setting, we give sufficient conditions on the lumping 

^-H to both preserve the entropy rate and result in k-ih order homogeneous 

KH Markov chain. Every non-trivial lumping of a Markov chain with positive 

transition matrix incurs an entropy rate loss. 
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1 Introduction 

The entropy rate of a stationary stochastic process is the average number of bits 
per time unit needed to encode the process. A lumping of a (stationary) Markov 
chain is a coordinate-wise projection of the chain by a lumping function. The 
resulting (stationary) lumped stochastic process is also called a functional hidden 
Markov model [EM(J2]. One can transform an arbitrary hidden Markov model 
on finite state and observation spaces into this setting [EAiU2, Section IV. E]. In 
general, the lumped process loses the Markov property [GL05] and has a lower 
entropy rate than the original Markov chain due to the aggregation of states 
[PinG4, WAGO]. 

The present paper investigates the structure of entropy rate preserving lump- 
ings of stationary Markov chains over a finite state space. The central result 
characterises the entropy rate preserving case in two ways. First, by a structural 
condition on the transition graph associated with the Markov chain. Second, by 
the growth of the number of realisable preimages of a realisation of the lumped 
process. We document a strong dichotomy between the preservation and loss 
case: uniform finite bounds on the lost entropy and the number of realisable 
preimages in the former and a linearly growing entropy loss and an almost- 
surely exponentially growing number of realisable preimages in the latter. 

In particular, a positive transition matrix always implies an entropy rate 
loss for a non-identity lumping. We state sufficient conditions on a lumping of a 
Markov chain with a sparse transition graph to preserve the entropy rate. The 
representation of each finite-state stationary stochastic process as a lumping of 
Markov chain on an at most countable state space by Carlyle [CarG7] fulfils this 
condition. 

More refined sufficient conditions additionally yield higher-order Markov be- 
haviour of the lumped process. This behaviour is highly desirable from a sim- 
ulation point of view. The sufficient conditions for sparse transition matrices 
complement Gurvits & Ledoux's [GL05] result that lumpings of positive tran- 
sition matrices having higher-order Markov behaviour are nowhere dense. 

2 Main results 
2.1 Preliminaries 

We write [n,m[:= {k G Z : n < k < m} and all obvious variations thereof. In 
particular, we abbreviate [n] := [l,n]. 

We briefly review the information-theoretic basics from Cover & Thomas [CT06] . 
Let Id denote the binary logarithm. By continuous extension we assume Id = 
0. The Shannon entropy [CTOG, (2.1)] of a rv ^ taking values in a finite set Z 
is 

i/(Z) := - ^P(Z = z)ldP(Z = z) . (la) 
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The conditional entropy [CT06, (2.10)] of Z given W is defined by 

H{Z\W) := V{W = w)H{Z\W = w) . (lb) 

Successive conditioning reduces entropy [ lAiii, Theorem 2.6.5]: 

H{Z)>H{Z\Wi)>H{Z\Wi,W2). (Ic) 

For a stationary stochastic process Z :~ {Zn)nez on a finite state space Z, the 
entropy rate [CT06, Theorem 4.2.1, pp. 75] is 

H{Z):= lim -i/(Z[„])= Hm i/(Z„|Z[„_i]) . (Id) 

The left limit in (Id) is the limit of the normalised block entropy H{Z^n\)- By 
stationarity and (Ic) the i?(Z„|Z[„_i]) in the right limit are monotonically 
decreasing. 



2.2 Setting 

This section describes the setting of our work. Let X := (X„)„gz be an irre- 
ducible, aperiodic, homogeneous Markov chain on the finite state space X. It 
has transition matrix P with invariant probability measure We assume that 
X is stationary, that is ^ /i. The lumping function g is X ^ y and sur- 
jective. We assume g to be non-trivial, that is 2 < |3^| < \X\. Without loss of 
generality, we extend g to Af" — > 3^", for arbitrary n e N. The lumped process 
of X under g is the stochastic process Y := (F„)„gz defined by F„ := g{Xn)- 
We refer to this setup also as the lumping (P, g) . 

The lumping induces a conditional entropy rate [GKll, WA60], which char- 
acterises the average information loss per time unit; 

HiX\Y) := lim -i7(X[„] |y[„]) = H{X) - H{Y) . (2) 

n— foo Tl L J L J 

Our main question is whether H{X\Y) is positive or zero. We speak of entropy 
rate loss or entropy rate preservation respectively. Note that H{X\Y) — does 
not imply that the original process can be reconstructed from the lumped pro- 
cess (see figure 3 for an example). 

Recall that the transition graph G of the Markov chain X is the directed 
graph with vertex set X and an edge {x,x'), iff P[x,x') > 0- Our results partly 
depend only on the structure of G, that is, whether an entry of P is positive 
or zero. In particular, if we want to investigate the number of realisable preim- 
ages of a given finite-length lumped trajectory, then this number only depends 
on the structure of G. Informally, multiple realisable preimages are intimately 
connected with a loss of entropy when passing through the lumped trajectory 
of finite length. To describe up to which length lumped trajectories have unique 
realisable preimages, given knowledge of the original endpoints we introduce a 
key quantity of the lumping {P,g): 

3^,ieA',ye3^":3x',x"e.9-i(y): 



/C := inf < n e N 



st both 



'p(Xo - X, X[„] = x', X„+i = i) > )■ (3) 
P(Xo = X, X[„] = x", X„+i = £) > 
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Figure 1: (Colour online) An example of /C = 3 with two realisable length-5 
trajectories (i, x'j^, Xj, 0:3, and {x,x'(,X2,x'^,x) with the same lumped image 
{9{x),yi,y2,y3,9{x))- The lumped images {g{x),yi,y2,y3, 9{x)} do not have to 
be distinct and may even all be the same. 

Thus /C e N U {00}. Figure 1 on page 5 gives an example of K. — 3. 

2.3 The characterisation of entropy rate loss 

This section presents the characterisation of the entropy rate loss of a lumping 
in terms of /C and the preimage count. The preimage count of length n of the 
lumping (P, g) is the random variable counting the realisable preimages of the 
lumped trajectory of length n. Using Iverson brackets, this is 

r„:= []P(^N = x) > 0] . (4) 

xGg-i (¥■[„]) 

Our main result is 
Theorem 1. 

H{X\Y) >0 ^ /C<oo ^ 3C>1: P(liminf 0^ > C) = 1 , (5a) 
HiX\Y) = ^ /C = cx) ^ 3 C < 00 : P( sup T„ < C) = 1 . (5b) 

n— >C30 

The proofs of all statements in this section are in section 3. The constants 
in theorem 1 are explicit functions of {P,g); see (37) for (5a) and (15) for (5b). 
Likewise, an explicit lower bound for the entropy rate loss in case (5a) is stated 
in (34). Two further results, refining the dichotomy in theorem 1, are uniform 
bounds on the conditional block entropies contingent on /C and an upper bound 
on K. in the loss case. 

Proposition 2. We have 

Vn</C + 1: i/(X[„]|r[„])<21d(|A'|-|3;| + l). (6) 
Proposition 3. In case (5a), we have 

/C<^ri(2/)|(ri(y)|-l). (7) 
yey 

Theorem 1 and the following propositions reveal a following dichotomy in 
behaviour of the entropy of the lumping. If /C is infinite, then arbitrarily long 
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trajectories of X can be reconstructed from the lumped trajectory and knowl- 
edge of the cndpoints. Therefore, the only entropy loss occurs at those endpoints 
and is finite. This yields uniform finite bounds on the conditional block entropies 
and the preimage count. On the other hand, if /C is finite (see figure 1), then at 
least two realisable length- (/C-|- 2) trajectories of X with the same lumped image 
bifurcate and join. This situation leads to a finite entropy loss. Reasoning based 
on the ergodic theorem ensures that this situation occurs linearly often in the 
block length, thus leading to a linear growth of the conditional block entropy. 
This translates into an entropy rate loss. 

Another interesting and direct corollary is 
Corollary 4. If P is positive, i.e., all its entries are positive, thenH{X\Y) > 0. 
Proof. If P is positive, then K, — 1. □ 

Thus, the search for entropy rate preserving lumpings must be in the space 
of sufficiently sparse transition matrices P. 

2.4 Sufficient conditions for entropy rate preservation and 
k-lumpability 

We present easy-to-check sufficient conditions for the preservation of the en- 
tropy rate. Their proofs are in section 4. Our conditions depend only on the 
transition graph G and the lumping function g. The conditions are only stated 
in a "forward form" ; applying the conditions to the time-reversed setting yields 
a set of mirrored sufficient conditions, which we omit. 

Our first sufficient condition is 
Definition 5. A lumping {P,g) is single entry (short: SE), iff 



The class of single entry lumpings is entropy rate preserving: 

Proposition 6. // {P,g) is SE, then HiX\Y) = 0. 

Figure 2 on page 7 shows that SE is not necessary for entropy rate preser- 
vation. 

The case of the lumped process retaining the Markov property is desirable 
from a computational and modelling point of view. However, in general, the 
lumped process Y does not possess the Markov property [KSTG, GL05]. Its 
distribution is more complex, despite living on a smaller set. Thus one may 
hope that the lumped process belongs to the larger and still desirable class of 
higher-order Markov chains. 

Definition 7. A stochastic process Z :— {Zn)nem o, k-th order homogeneous 
Markov chain (short: Z is HMC{k)) iff 

Vn G Z, /c < m e N, z„ G Z, 2[„-m,n-i] G -2™ : 

-P(-^ra — Zn\Z^n—m,n[ — ^[n—m,n\) — lP(-^n — Zn\Z^n — k,n[ ^[n— fc,n[) ■ (9) 



Vy g 3^,a; e A" : 3!.t' : 



x'{x,y)eg-\y):^x" eg-\y)\{x'}: 

P(Xi =a;"|Xo = = =0. (8) 
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Figure 2: (Colour online) The transition graph of a Markov chain with the 
lumping represented by red boxes. The lumping is neither SE (violated by tran- 
sitions from a into B) nor its mirror condition for the time-reversed process 
(violated by transitions from D to f). On the other hand, the existence of the 
uniquely represented states Ci and C2 allows to distinguish between the tra- 
jectories (a, 61, ci, di, /) and (a, 62, C2, (^2, /)• Therefore the lumping preserves 
the entropy rate. Furthermore, this lumping is weakly 1-lumpable. Hence it 
shows that SE is neither necessary for entropy rate preservation nor for weak 
fc-lumpability. This also applies to SFS(fc) (see definition 9), which is a subclass 
of SE. 



Definition 8 (Extension of [KS76, Def. 6.3.1]). A lumping {P,g) is weakly 
/c-lumpablc (short: X is fc— lump^, iff Y is HMC(fc). // this holds for each dis- 
tribution of Xq, then we call {P, g) strongly /c-lumpable. 

Definition 9. For k > 2, a lumping {P,g) has the single forward /c-sequence 
property (short: SFS(fc)j, iff 

Vy e y''-\y ey:3W:^ x'(y, y) G g-\y) : 

Vxe5"'(2/),xe.g-i(y)\{x'}: 

P(X[,_i] = x|y[,_i] =y,Xo = x)=0, (10) 

i.e., iff observing a length-k lumped trajectory uniquely determines the last (fc— 1) 
elements of its preimage. 

The single forward fc-sequence property implies entropy rate preservation 
and fc-lumpability: 

Proposition 10. If {P,g) is SFS(fc), then it is fc— lump and SE. 

Figure 2 on page 7 shows that SFS(fc) is not necessary for weak 1-lumpability 
and entropy rate preservation. Figure 3 on page 8 demonstrates that SFS(2) is 
neither necessary for SE nor for strong 1-lumpability. Finally, figure 4 on page 
8 shows that SE does neither imply SFS(fc) nor fc— lump, for every fc. 

2.5 Further discussion 

Functions of Markov chains have been considered in the literature for a long 
time: In particular, whether the function of a Markov chain possesses the Markov 
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Figure 3: (Colour online) The transition graph of a Markov chain with the lump- 
ing represented by red boxes. The lumping is SE and thus preserves the entropy 
rate. Furthermore, it is strongly 1-lumpable and thus B{Y\\Xq) — R{Y\^q) 
(see proposition 15). However, observing an arbitrarily long trajectory of the 
lumped process does not determine the current preimage state. Whence (P, g) 
is not SFS(fc), for every fc. Therefore, SFS(A;) is not necessary for entropy rate 
preservation and strong lumpability. 



A 




Figure 4: (Colour online) The transition graph of a Markov chain with the 
lumping represented by red boxes. The lumping is SE. The loops at h\ and hi 
imply that the lumped process is not HMC(fc), for every k. This is easily seen by 
the inability to differentiate between n consecutive &i's and n consecutive 62 's. 
When starting in B and as long as ^^(61,0) 7^ P(b-2,a)i this long sequence of Ss 
prevents determining the probability of entering A. Thus this is neither SFS(A:) 
nor fc— lump, for each k. 
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property or not has been investigated in, among others, [BR58, RP81]. Kemeny 
& Snell [KS7G] coined the term lumpability for this case. Higher-order lumpabil- 
ity, as we use it in this work, has been analysed by Gm'vits & Ledoux [(tLO")], 
and they showed that the class of Markov chains being lumpable is nowhere 
dense. 

A related problem is the identification problem, initially posed by Blackwell 
& Koopmanns [BK57] (for a recent treatment see [And99]): given a stationary 
process on a finite state space, is it representable by a lumping of a Markov 
chain? There are two results from research into this topic having a connection 
to the present work. 

First, Carlyle [CarGT] showed that every stationary stochastic process on a 
finite state space can be represented as a lumping of a Markov chain on an at 
most countable state space. If the representation involves a Markov chain on a 
finite state space, then it is SE proposition 6 guarantees entropy rate preserva- 
tion of the representation. 

Second, Gilbert [Gil")!)] showed that the distribution of a lumping of finite- 
state Markov chain is uniquely determined by the distribution of m consecutive 
samples, where m depends on the cardinalities of the input and output alpha- 
bet. This does not contradict the nowhere dense result of Gurvits & Ledoux, 
however, since the construction of the process distribution is different from a 
product of conditional distributions (as it is in the case of lumpability) . 

Moreover, the nowhere dense property does not prevent our results from 
being practically relevant. In particular, our sufficient condition holds for com- 
plete lower-dimensional subspaces of the space of Markov transition matrices. 
In other words, if the transition matrix is sufficiently sparse, one can hope 
that the lumping satisfies some of our sufficient conditions. More generally, 
one can hope that for a given Markov model there exists a lumping func- 
tion with a desired output alphabet such that the resulting lumping satisfies 
our sufficient conditions. Sparse transition matrices are used, e.g., in n-gram 
models in automatic speech recognition [BdM+92, Table 1], chemical reaction 
networks [HMMWIO, HRSSIO, Willi] and link prediction and path analy- 
sis [SarOO]. 

Although we treat only the stationary case, we expect that our results are 
generalisable to the case where the Markov chain is time-homogeneous, irre- 
ducible and aperiodic, but does not start in equilibrium, i.e., where the initial 
distribution does not coincide with the invariant measure. Entropy rates exist 
for a larger class of non-stationary stochastic processes, namely those station- 
ary in the asymptotic mean [KR81, Gra90] (short: AMS). Finite state Markov 
chains and lumpings thereof are AMS [KR81]. A more difficult question would 
be if aperiodicity could be dropped. 
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3 Proof of the characterisation 

Proof of theorem 1. The proof of the preservation case (5b) of theorem 1 and 
of proposition 2 are in section 3.1. The proof of the loss case (5a) of theorem 
1 and of proposition 3 are in section 3.4. Sections 3.2 and 3.3 contain technical 
groundwork and results about Markov chains needed in the proof of the loss 
case in section 3.4. 

The statement in (5) follows from the mutually exhaustive implications 

/C < oo ^ H{X\Y) > 0, (11a) 

/C = oo ^ H{X\Y) = (lib) 

and 

/C<oo^3C>0: P(liminf > C) = 1 , (12a) 

n— ^OO 

JC^oo ^ 3 C > : P( sup r„ < C) = 1 . (12b) 

n— >-oo 

The proofs of implications (lib) and (12b) are in section 3.1 and the proofs 
of implications (11a) and (12a) are in section 3.4. 

□ 

3.1 The preservation case 

The definition of K, in (3) implies that lumped trajectories of length less than 
K, have a unique preimage contingent on the endpoints, i.e., if n < /C, then 

P(Xo = i,y[„]=y,X„+i=i)>0 

^ 3! X e A-" : P(X[„] = x|Xo = x, y[„] = y, X„+i =x) = l. (13) 

Proof of proposition 2. Recall that we assume n < IC + l. The unique preimage 
(13) implies that the conditional entropy of the interior of a block, given its 
lumped image and the states at its ends, is zero: 

= X! ^^-^^ = X, X„ = X, Y[„] = y) iJ(X[2,„_i] |Xi = X, X„ = X, Y[„] = y) 



x,x&X =0 by (13) 



0. (14) 



We apply the chain rule of entropy (cf. [CT06, pp. 22]) to decompose the condi- 
tional block entropy into its interior and its boundary. The interior vanishes by 
(14) and the entropy at the endpoints is maximal for the uniform distribution: 

H{X^n]\Y[n\) = H{X\^2,n-l]\Xl,Xn,Y^n]) + H {X i , X n\Y\^n]) 

<o + i/(Xi,x„|yi,r„) 

< 2H{X^\Y^) 

<2max{ld|g-i(y)| :yG3^} 
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<21d(|A'|-|J^| + l). 

□ 

Proof of (lib). As /C = oo, the bound from (6) holds uniformly. Thus 

n— >-oo Ti n— >-oo Jl 

□ 

Proof of (12b). Recall that we assume K. — oo. We show that, for y e 3^" with 
P(Y[„] = y) > 0, we have 

nTn<{\X\-\y\ + lf\Y[n]=y) = l. (15) 
This implies (12b). To show (15), we use (13) to bound 
J2 [P(Xm=x)>0]. 

xeg-i(y) 

J2 MXi = xi , X„ = |y[„] = y) > 0] 

a:i,a;„eg-i(y|i „}) 

X [P(^[2,n-1] = x|Xi Xi,X„ = X„, Y[2,„_i] = y[2,«-l]) > 0] 

xeg-i(y[2,„-ij) 

< ^ [P(Xi - XI , X„ = .T„ I Yi„] = y) > 0] 

a:i,a;„eg-i(y{i,„}) 

<l5-'(y{i,„})l < (1-^1-13^1 + If- 

□ 

3.2 Non-overlapping traversal instants 

The main result of this section in proposition 11 is an almost-sure linear lower 
growth bound for non-overlapping occurrences of a fixed, finite pattern in a re- 
alisation. 

Let Z := (Z„)jigN be a stationary stochastic process taking values in Z. The 
occupation instants of a state z is the set of indices 

0§{n) := {i e [n] : Zi = z} . (16a) 

The classic occupation time [Par99, section 6.4] is the cardinality of the occupa- 
tion instants. The traversal instants of a sequence z e Z*^ is the set of indices 

r/(n) := {te[n-k + l]: Z^.^.+^-i] = z} . (16b) 

The non-overlapping traversal instants of a sequence z e is the set of indices 

Mi{n):^\ie[n-k + l]: ^ = z 1 ^^^^^ 

[ V j e [i + 1, z -I- fc - IJ : Z[j / z J 
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where we select lower indices greedily. 

For A; e N, the k-transition process Z'^^^ of Z is the stochastic process on 
with marginals 

P(47 = (z*)r=i) = P(Z[„-i] = (z'{ij)^-i\ Z[„,„+,_i] = z") , (17) 
if V?' e [n — 1] : Zp ^ = ^[fc^ijj ^''^d zero else. Obvious relations are 

r/(n)=.0|<.,(n-fc) (18a) 

and 

^fz{n)Qriin) with |A/-|(n)|>^|r/(n)|. (18b) 
Proposition 11. Lef s e with p P(-'^[fc] = = S{i}) > 0. Then 

liminf-|AAi(n)| > ^^^1) = 1 (19a) 

n->oo n k J 

and 

Ve > : Jirn^P (^|Ar|(n)| > (^ ^^^^^^^^ - = 1 . (19b) 

Lemma 12 (Ergodic theorem [\Vot'(J9, theorem 3.55 on page 69]). For every 
homogeneous, irreducible and aperiodic Markov chain Z := (Zn)nez on a fi- 
nite state space Z with invariant measure v, all f : Z ^ R and each starting 
distribution a G A4i(Z) of Zi, we have 

p„ ( hm - J2 /(^^) = / fi^)M^) '^ifn = 1 ■ (20) 

Proof of proposition 11. Statement (19b) is a direct consequence of (19a). 
The fc-transition process X'^^^ is a Markov chain with transition probabilities 

■ [0 else. ^ ' 

Furthermore, as X is irreducible and aperiodic, then so is X*^'''^ Its invariant 
measure fulfils /i('=)(x) m(x{i}) ^(x{i},x{i+i})- 

Let / be the indicator function of s. We use (18) and lemma 12 to derive 

liminf-|Ar|(n)| > ^ ^ 

> P I liminf-|rx(n)| > n^''\f) 



liminf-|0^(,,(n-fc)| >^W(/) 
lim -|Oi<„(n)|>MW(/) 



n— >-oo n 



= 1. 

Finally /x('=)(/) = /i('=)(s) = ^^^(s^i}). □ 
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3.3 Conditional Markov property 

This section presents two technical statements about discrete Markov processes. 
Let X :— {Xn)nei be a stochastic process on the Cartesian product S := 
Yln&Sn of the finite sets (5„)„gz- For ^ C Z let Sa OnGA'^'i- 
remainder of this section, we assume that all conditional probabilities are well- 
defined. The process X is Markov, iff 

Vn G Z, m G N, s„ G 5„, „_i] G 5[„_,„ „_i] : 

^{Xn = •Sn|-''^[n-m,ri-l] ~ S[ji-m,ri-l]) = ^{Xn = S„|X„_i = S„_i) . (22) 

We denote by ^ (s Z the fact that A is a finite subset of Z. The first statement 
is a factorisation of conditional probabilities over disjoint index blocks: 

Vm G N,0 ^ . . ,B„,_i,^„ d Z,Bo,B,„ d Z, 

A n S = where A [+) and S (+J Bi.XA e 5a, -tb G 5s, 

2 = 1 1 = 

(VzG H : V :=sup(S,_i),6+ := inf(B,), ^» ^ ]&-, : 

rn 

P(Xa = xaI^s = xb) = WHXa^ = XA^X^- - H-^^ht = • (2^) 

2 = 1 

Secondly, a Markov process retains the Markov property under a Cartesian 
conditioning: 

V0 7^ C (E Z, 5c := n ^itl^ C 5„ : {X\Xc G 5c) is Markov. (24) 

Proof. We need the intermediate statements 
Vn G Z, 7^ i? d] — oo, 7i[, a;„ G S^xb G 5b : 

P(X„ = Xnl^B = Xb) = P(^n = a;„|Xmax(B) = a;niax(S)) (25) 

and 

V0 ^ A, S (E Z,max(B) < m\n{A),XA ^Sa.Sb'^ {a;max(B)} x 5s\max(B) : 
P(Xa = xa\Xb G 5b) = P(Xa = XA|X^ax(B) = a;„,ax(B)) • (26) 

Proof of (25): Let C :=] max(B), n[ and D := [min(S), max(B)] \ B. We use 
(22) to get 

V{Xn--Xn\XB=XB) 

_ Y.XC.XD ^i^^ = Xn.Xc = a;c,^B = xb.Xd = xd) 
" P(Xb = xb) 

P(-'^n = a;„,Xc = xc\Xb = xb,Xd = xd)P{Xb = xb,Xd = xd) 

" P(Xb = xb) 

^P(X„ = Xn,Xc = a;c|-'«^max(s) = X^^^(b))^V{X B = a^B,-'^!) = Xd) 

~ P(Xb - xb) 
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— P(X„ — a;„|Xjjjax(S) — ^ma.x{B)) ■ 

Proof of (23): For A d N, we abbreviate the event Ea := {X^ — x^)- Apply 
(25) to get 

^ ViXA = XA,XB ^Xb) 

ViXB = Xb) 

m 

\\{EA,)j<i, {EBj)j<i, Ei^+)¥{E^+ , EA,\{EAj)j<i, {EBj)j<i) 

_ i=l 

m 

HEbo ) n IP(^BA{bn I (^B, ),<^ , E,+ )P(i?,+ I {Eb^ ),<,) 
1=1 

nEBMK}\^K^^^^K\V 

m 

= 1[¥{EaM-,E,^). 

i=l 

Proof of (26): Let b := max(S). We apply (25) to get 



¥{Xa^xa\Xb e Sb) 

^ E.,eSu^iXA^XA,XB = XB) 

ViXB e Sb) 

_ Sx^SSb ^(^A = XA\Xb = XbMXB ^ xb) 

V{Xb e Sb) 

^P{XA = XA\Xb^Xb). 

Proof of (24): Let n e Z, m e N, B := [ n — m,Ti[, Xji G Sn, Xb G Sb- Let 
C+ := Cn [n,oo[ and C_ := Cn] - oo,n[. Thus Sc = Sc_ x Sc+- We apply 
(26) twice to show that {X\Xc & Sc) fulfils (22) and is thus Markov: 



P(X„ = Xn\XB ^XB,Xc£ Sc) 

^ VjXn = Xn,Xc^ e Sc+,Xb = XB,Xc_ e Sc_) 

¥{Xb = xb,Xc&Sc) 
^ ^ x^.Xc^ e Sc^\Xb ^ XB.Xc_ e Sc_)F{Xb ^ XB,Xc_ e ScJ 

P(^c+ e ^cj^s = a;i3,Xc_ G ^cJP(^b = ^b,^c_ e Sc_) 
_ P(X„ = a:„,Xc-_^ £ ■5'c^l^K-i ^ a:„_i) 
" P(^c+ e 5cjX„_i = a:„_i) 
= P(X„ = x„|X„_i = a;„_i, Xc+ e Sc+) 
= P(X„ = a;„|X„_i = Xn-i,Xc^ e 5c+,Xc- e ^C-) 
= P(X„ = a;„|X„_i = Xn-i,Xc G 5c) • 

□ 
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3.4 The loss case 

We start with some derivations common to the proof of (11a) and (12a). We 
assume /C < oo. Equation (3) is equivalent to the existence of x,x £ X,y e 
3^'C,x e .9~i(y) with 

< P(Xo = X, X[yq = X, - £) < P(Xo = X, %] = y, X,c+i = x) . (27) 

Let s (i:,x, i). The unreconstructahle set of trajectories % is 

n := {x} X g-\y) X {x} . (28) 

Equation (3) implies that T-L contains at least two elements with positive prob- 
ability. If we pass through we incur an entropy loss L: 

L:=i/(X[;c]l^[ox+i] eH) >0. (29) 

Let T be the random set of indices marking the start of non-overlapping runs of 
X[„] through %, that is 

ie[n-JC-l]: and > , (30) 

V j e [i + 1, z + /C + 1] : ,+;c+i] ^ 

where we select lower indices greedily. For the s from after (27), we lower-bound 
the tail probability of the cardinality of I by the one of Afx (n) : 

VmeN: P(|I| > m) > P(|7Vi(n)| > m). (31) 

Finally, let 

Proof of (11a). We claim that, for every m E N: 

i^(^[„]|y[„]) > > m)i/(X[„]|y[„], |Z| > m) > P(|X| >m)mL. (33) 

Combining (33) and (31), for m ~ an, with (19b), we arrive at (11a): 

HiX\Y) = hm -i/(X[„]|y[„]) 
> lim -P(|X| > an) anL > aL lim F{\M^{n)\ > an) = aL > . (34) 

n—^oc Ti n—^oo 

It rests to prove (33). We fix m,n e N. For / C [n] with P(Z = /) > 
and each z S /, we derive the indices of the block Bi := + k + 1] and its 
interior Bi := [i + l,i + k]. Their unions are B := l+Jj^j Bi and B := 1+Jjg/ Bi 
respectively. Hence 

>H{Xg\X[^]\B,yieI:XB,en), (35a) 
= H{Xg\yieI -.XB^eH), (35b) 



15 



Geiger & Temmel 



Lumpings of Markov chains and entropy rate loss 



= Y,H{X^}XB,en), (35c) 

= |/|xL, (35d) 

where in (35a) we throw away all information outside B and condition on it, in 
(35b) we apply the conditional factorisation (23) twice to remove every condition 
except the block ends, in (35c) we apply the conditional factorisation (23) to 
the Markov process {X\Xb £ H'^') and in (35d) we conclude by stationarity 
and the minimum loss (29). Hence 

if(X[„]|Y[„],|I| >m) = J2 n^ = im>Tn)H{X[r,]\Y[r,],I^I) 

IC[n] 
\I\>m 

> ^ P(T = I\\I\ >m)x\I\x L 

IC[n] 
\I\>m 

> mL . 

□ 

Proof of (12a). For the s from after (27), we P-almost surely have 

T„ > 2\^x(")\ . (36) 

Thus, (36) and (19a) imply that 

lim inf y% > liminf exp((log2)-|7Vi(7i)|) 

= exp((log2)liminf-|Ar|(n)|) > exp((log2)a) = 2" > 1 . (37) 

□ 

Proof of proposition 3. Let Xq, X]c+i,y, x', x" as in (3). Suppose that K. > K := 
^y^y \9~^{y)\{\9~'^{y)\ ~ 1) s-nd /C > 1. We apply the pigeon-hole principle first 
to every x £ g~^{y) and then to each g^^{y), for every y e suppy. This ensures 
that the two trajectories intersect: 

3m e [/C] : x'{„j} = x"{„} . (38) 

Choose m fulfilling (38). If m 1, then x'{i} , xki+i, yp.K;], x'px], x"[2,k;] fulfil 
the conditions in (3). If m > 1, then a;o,x'{,„},y[m_i],x' x"[„i_i] fulfil the 
conditions in (3). Both cases lead to K, < K,, a contradiction. □ 

4 Proofs of the sufficient conditions 

4.1 Single entry implies entropy rate preservation 

Proof of proposition 6. 

H{X\Y) 
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lim ^7J(X[„]|y[, 



1 " 

= lim - VH(X,|X[,_i],y[„]) 

i=l 

1 " 

< lim -V (39a) 

i=l 
1 " 

= lim - Vi/(X,|X,_i,y,) (39b) 

n— foo Ji ^ — ^ 

i=l 

= H{Xi\Xo,Yi) 

where (39a) is due to ^[i-i] — g{X[i_i]) and because conditioning reduces en- 
tropy and (39b) is by Markovity of X from (24). We continue 

Hix,\xo,Y,)^- p(^i - .XI, = Id : = o , 

~^ r^ri - y\Ao — xq) 

because {P,g) is SE. Thus ¥{Xi = xi\Xo = xq) = P(Yi = y\Xo = a;o), if 
xi = x'{xo,y), and zero otherwise. □ 

4.2 Single forward property implies fc-lumpability 

For (conditional) probabilities we use the following short-hand notation: 

V{Z = z)=pz{z) and P(Zi = zi|Z2 = Z2) = Pzi|Z2(zi|22) , 
where we always assume that the latter is well-defined, i.e., that ^22(^2:2) > 0. 

We start with some auxiliary results: 
Proposition 13. 

Y IS HMC(fc) ^ H{Y) = H{Yk\Y^o,k[) . (40) 

Proof. 

= H{Yk\Y[oM)-H{Y) 
= lim H{Yk\Y[oM)~HiY^\Y[o,r.[) 

= lim H(y„|r[„_fc,„[)-ij(r„|r[o,„[) 

= lim I(Y,i;Y[Q,n-k[\Y[n-k,n[) ^ 

n— foo 

where the last quantity is the conditional mutual information between Yn and 
Y[Q n-k[ conditioned on y[ri-fe,n[. By stationarity, this sequence increases mono- 
tonically in n, thus, setting it to zero implies [GraOO, Lemma 3.15, pp. 88] that 
for all n e N: 

pyjF[„_,,„[(-|y)py[o.„„,[|r[„_,,„[(-|y) 
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= Pi'„|y[a,,.[('l''yHo,.-Mli'["-'=."[('|y)' 

where the first equality holds PYi„^k ni~^-^- "^^^ equality between the first and 
last line is equivalent to the Markov property in definition 8. □ 

Lemma 14 ([CT(J(), Thm. 4.5.1, pp. 86]). If X is a stationary Markov chain 
and Y a stationary process defined by Yn '■— g{Xn), then the entropy rate ofY 
is bounded by 

VfceN: HiYk\Y[k-i],Xo) <H{Y) < H{Yk\Y^oM) ■ 

A direct solution for the entropy rate of the lumped process Y was shown 
to be intrinsically complicated in [Bla-")?]. However, both the upper and lower 
bound are asymptotically tight [ ' , Thm. 4.5.1, pp. 86]. 

Proposition 15. The following facts are equivalent: 

PY,\Yi,,,i^xo{y\y,x) ^ PY,\Yi,,,i.Yoiy\y^9{x)) , (iia) 

H{Yk\Y[k-i],Xo) - H{Yk\Y[o,k[) ■ (41b) 

They are sufficient for k-lumpability of {P,g), but necessary only for strongly 
k-lumpable {P,g). 

Proof. Statement (41b), together with proposition 13 and lemma 14, implies 
the fc-lumpability of {P,g). 

Equivalence between (41b) and (41a): 

- H{Yk\Y[oM) - HiYk\Y[k-i],Xo) 
= H{Yk\Y[0M) - HiYk\Y[oM^Xo) 
^I{Yk;Xo\Y[o,ki), 

which is equivalent to 

Pn,xo|F[o,,[(-|y) =Pn-|F[o,.[(-|y)pxo|y[o..[('ly) 

and thus 

PY,\YioM,Xoi-\y,x) ^ PY,\Yio,,i{-\y) , 

for all X, y such that pxa.Yio kS^^ y) ^■ 

The proof that strong fc-lumpability implies (41) follows the one in [KS76, 
Thm. 6.3.2]: Strong lumpability requires that PVklYio ^livly) independent of 
the distribution of Xq, thus in particular, it needs to have the same value for 
all distributions placing unit mass on a state in x' E ,9^^(y{o})- We have 

PY,\YioM(y\y) 

J2 PY,\Yio,,i,xo{y\y,x)pxo\Yio,,i{^\y) (42) 
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Since 



and since 



PX„{x)pYio^,l\Xo{y\x) 



Y. Pxo\Yio,JMy) = , 

a;G9"^(y{o}) 

it follows that PvioMlXoiyl^') = PYio.kM") (PXoi^) = 0' for all X e .g"^(y{o}) \ 
{a;'}). The sum in (42) degenerates and we obtain 

PY,\Yfo,,(iy\y) ^ PY.iYio^.i.xMy^^l- 
stTong lumpability implies that PYk\Y[o ^[(yly) independent of the distribution 
of Xq, whence (41) follows. □ 

We are currently not sure if (41) and strong fc-lumpability are equivalent. 



Example 16 (taken from [ , pp. 139]). Consider the following transition 
matrix, where the lines divide lumped states: 



1/4 


1/16 


3/16 


1/2 ■ 





1/12 


1/12 


5/6 





1/12 


1/12 


5/6 


7/8 


1/32 


3/32 






This lumping (and its time-reversal) is 1-lumpable [KS76, pp. 139]. However, 
we have (with an accuracy o/O.OOOlj 



0.5588 = H{Yi\Xn) < H{Y) = H{Yi\Y(,) = 0.9061 



and 



0.9048 = H{Yq\Xi) < H{Y) = H{Yq\Yi) = 0.9061 . 
Hence A:— lump does not imply (41). 

Proof of proposition 10. We first show that SE contains SFS(fc), which implies 
preservation of entropy. We have 



?5jf[,_i,|F[,_i,,Xa(x|y[fc-l],a;o) - 



_ j 1 if X = x'(y[o,fc-i]) from (10) , 



else. 



If SE does not hold, then there exist states y* & y and x* € X such that at 
least two states in g~^{y*) have positive transition probabilities from a;*. As- 
sume that y{fe-2,fc-i} — {9{x*)i y*)- Thus, observing y[o,fc[ does not determine x. 

Second, we show that SFS(fc) implies fc-lumpability of {P,g). By proposi- 
tion 15, it suffices to show that 

,^0 iy{k} |y[fc-i] - ^o) = PYk\Y\oM^yw 

holds P-almost surely. Knowing Y[o.fc[ = y[o,fc[i however, determines re- 
gardless of G g^^iyo)- The Markov property implies 
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PYk\Yik_^ ,xo (y{fe} |y[/c-i] ,xq) - 1 Y[o,fc[ (y{fe} |y[o,fe[) 

= pmxfc_i(y{/c}|x'(y[o,/c[){/c-i}) • 

Moreover, the SE property imphes that, for every x ^ X ,y £ y , 

PYk\Xk^^{v\x) ^px^\Xk_^{x{x,y)\x) , 

where x' {x^ y) is from (8). This yields another proof for entropy rate preservation 
by equating the outer terms of the following chain of inequalities (the first 
inequality is due to lemma 14, the second due to data processing [GKll, WA60]): 

H{Yk\Y[k-i].Xo) < H{Y) < H{X) = H{Xk\Xk-i) . 

□ 
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